ROCm et HIP : Un tutoriel détaillé en 10 chapitres : La nature centrée sur la mémoire des performances du GPU

Dans l'accélération GPU, nous devons abandonner le raisonnement "calcul d'abord". Les performances modernes sont dictées par Gestion de la mémoire: l'orchestration de l'allocation, de la synchronisation et de l'optimisation des données entre l'hôte (CPU) et le périphérique (GPU).

1. Le déséquilibre entre mémoire et calcul

Alors que le débit arithmétique des GPU ($TFLOPS$) a explosé, la bande passante mémoire ($GB/s$) a augmenté à un rythme bien plus lent. Cela crée un écart où les unités d'exécution sont souvent « privées », en attente de données provenant de la VRAM. En conséquence, la programmation GPU est souvent une programmation mémoire.

2. Le modèle Roofline

Ce modèle illustre la relation entre Intensité arithmétique (FLOPs/Octet) et les performances. Les applications se divisent généralement en deux catégories :

Limité par la bande passante : Limité par la bande passante (la pente raide).
Limité par le calcul maximal (TFLOPS) : Limité par le pic de TFLOPS (le plafond horizontal).

3. La taxe du déplacement des données

Le principal goulot d'étranglement des performances provient rarement des calculs ; il s'agit du délai et de la consommation énergétique liés au déplacement d'un octet via le bus PCIe ou depuis la HBM. Un code haute performance privilégie la localisation des données et minimise les transferts entre l'hôte et le périphérique.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.